Data Wra ngling Tidy Data - A foundation for wrangling in pandas 


. | F | m| a | Fi M|a_ Tidy data complements pandas’s vectorized M| x Es 
with pandas Cheat Sheet In a tidy operations. pandas will automatically preserve Í 
http://pandas.pyd ata.org data set: observations as you manipulate variables. No O- = 


; _ __ other format works as intuitively with pandas. 
Pandas API Reference Pandas User Guide Each variable is saved Each observation is M *k A 
in its own column saved in its own row 


Reshaping Data - Change layout, sorting, reindexing, renaming 


df.sort values('mpg' ) 
Order rows by values of a column (low to high). 


Creating DataFrames 


df.sort values('mpg’, ascending=False) 
Order rows by values of a column (high to low). 
df = pd.DataFrame( 
{"a" : [4, 5, 6], 
"b" : [7, 8, 9], 
we, : [10, 11; 1217, 
index = [1, 2, 3]) 
Specify values for each column. 


pd.melt(df) 
Gather columns into rows. 


df.rename(columns = {'y':'year'}) 


df .pivot(columns='var', values='val') 
Rename the columns of a DataFrame 


Spread rows into columns. 


| df.sort index() 


Sort the index of a DataFrame 


E — | | df.reset_index() 
df = pd.DataFrame( Reset index of DataFrame to row numbers, moving 
[[4, 7, 10], index to columns. 
te, 9, 1211, pd. concat([df1,df2]) e ea T ee df.drop(columns=['Length’, 'Height']) 
index=[1, 2, 3], Append rows of DataFrames ppend columns of DataFrames Drop columns from DataFrame 
columns=['a', 'b', 'c']) 
Specify values for each row. Subset Observations - rows Subset Variables - columns Subsets - rows and columns 
— _ = A — mE ë Usedf.loc[] anddf.iloc[] to select only 
TETE — m | 6 fon meena rows, only columns or both. 
—— N Use df.at[] anddf.iat[] to access a single 
df[df.Length > 7] df[['width?, 'length?’, 'species']] value by row and column. 
Extract rows that meet logical criteria. Select multiple columns with specific names. First index selects rows, second index columns. 
df.drop duplicates() df['width'] or df.width df.iloc[10:20] 
df = pd.DataFrame( Remove duplicate rows (only considers columns). Select single column with specific name. er 10-20 
{"a" : [4 ,5, 6], df.sample(frac=0.5) df.filter(regex='regex' ) df.iloc{:, [1 > 5]] 
“pb” g [7,. 8, Jla Randomly select fraction of rows. Select columns whose name matches heer tee estiene 1, 2and 5 (first 
"c" : [10, 11, 12]}, df.sample(n=10) Randomly select n rows. regular expression regex. salmat 6) P i 
index = pd.MultiIndex.from tuples( df.nlargest(n, 'value’) : df.loc[:, 'x2':'x4"] 
[Cd 1), Cea. 2), Select and order top n entries. Se i | i b andsa ncù 
Ce nane a a D) df.nsmallest(n, 'value') — elect a co umns PNNeenK an x une usive). 
T S selectand order otonmn entries query() allows Boolean expressions for filtering df.loc[df['a'] a 10, ['a s c']] 
df.head(n) rows. Select rows meeting logical condition, and only 
Srg select iren MWe df.query('Length > 7') the specific columns . 
Method Chaining df.tail(n) df.query('Length > 7 and Width < 8') df.iat[1, 2] Access single value by index 
Select last n rows. df.query('Name.str.startswith("abc")', df.at[4, 'A'] Access single value by label 
Most pandas methods return a DataFrame so that engine="python") 


another pandas method can be applied to the result. 
This improves readability of code. 


Logic in Python (and pandas) regex (Regular Expressions) Examples 


Group membership 'Length$' Matches strings ending with word 'Length 
Matches strings beginning with the word 'Sepal 


df = (pd.melt(df) 
. rename (columns={ 
'variable':'var', 
'value':'val'}) 
.query('val >= 200') 


"Ax[1-5]$' Matches strings beginning with 'x' and ending with 1,2,3,4,5 
"A(?!Species$).*' Matches strings except the string 'Species' 


Cheatsheet for pandas (http://pandas.pydata.org/ originally written by Irv Lustig, Princeton Consultants, inspired by Rstudio Data Wrangling Cheatsheet 


Summarize Data 


df['w'].value_counts() 
Count number of rows with each unique value of variable 
len(df) 
# of rows in DataFrame. 
df.shape 
Tuple of # of rows, # of columns in DataFrame. 
df[ 'w'].nunigue() 
# of distinct values in a column. 


df.describe() 
Basic descriptive and statistics for each column (or GroupBy). 
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pandas provides a large set of summary functions that operate on 
different kinds of pandas objects (DataFrame columns, Series, 
GroupBy, Expanding and Rolling (see below)) and produce single 
values for each of the groups. When applied to a DataFrame, the 
result is returned as a pandas Series for each column. Examples: 


sum() min() 
Sum values of each object. Minimum value in each object. 
count() max() 
Count non-NA/null values of Maximum value in each object. 
each object. mean() 
median() Mean value of each object. 
Median value of each object. var() 
quantile([@.25,0.75]) Variance of each object. 
Quantiles of each object. std() 
apply(function) Standard deviation of each 
Apply function to each object. object. 


Group Data 


df.groupby(by="col") 
Return a GroupBy object, grouped 
by values in column named "col". 


df.groupby(level="ind" ) 
Return a GroupBy object, grouped 
by values in index level named 
"ind". 


All of the summary functions listed above can be applied to a group. 
Additional GroupBy functions: 
size() 

Size of each group. 


agg (function) 
Aggregate group using function. 


aa Plotting ae 


df .expanding() 
Return an Expanding object allowing summary functions to be 


applied cumulatively. 

df.rolling(n) 
Return a Rolling object allowing summary functions to be 
applied to windows of length n. 


shift(1) 

rank (method='dense' ) 
rank(method='min' ) 
rank(pct=True) 


rank(method='first' ) 


Handling Missing Data 


df .dropna() 
Drop rows with any column having NA/null data. 


df .fillna(value) 
Replace all NA/null data with value. 


Make New Columns 


= a 
SSS ” SSS 


df.assign(Area=lambda df: df.Length*df.Height) 
Compute and append one or more new columns. 
df['Volume'] = df.Length*df.Height*df.Depth 
Add single column. 
pd.qcut(df.col, n, labels=False) 
Bin column into n buckets. 


E | || i 
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pandas provides a large set of vector functions that operate on all 
columns of a DataFrame or a single selected column (a pandas 
Series). These functions produce vectors of values for each of the 
columns, or a single Series for the individual Series. Examples: 
max (axis=1) min(axis=1) 

Element-wise max. Element-wise min. 
clip(lower=-10,upper=10@) abs() 
Trim values at input thresholds Absolute value. 


The examples below can also be applied to groups. In this case, the 
function is applied on a per-group basis, and the returned vectors 
are of the length of the original DataFrame. 


shift(-1) 
Copy with values lagged by 1. 
cumsum() 
Cumulative sum. 
cummax() 
Cumulative max. 
cummin() 
Cumulative min. 


cumprod() 
Cumulative product. 


Copy with values shifted by 1. 
Ranks with no gaps. 

Ranks. Ties get min rank. 

Ranks rescaled to interval [0, 1]. 


Ranks. Ties go to first value. 


df.plot.hist() 
Histogram for each column 


df.plot.scatter(x='w',y='h') 
Scatter chart using pairs of points 


Combine Data Sets 


adf 
xd | x2 
oF 
B 2 
c 3 


Standard Joins 


er pd.merge(adf, bdf, 


how='left', on='x1') 
B 2 F Join matching rows from bdf to adf. 
C 3 NaN 


EAEE pd.merge(adf, bdf, 
a 1.0 | how='right', on='x1') 
He 20 E on matching rows from adf to bdf. 


MD NaN TD 
ra pd.merge(adf, bdf, 


how="inner', on='x1") 
Bald LE] Join data. Retain only rows in both sets. 


, , 2 pd.merge(adf, bdf, 


how="outer', on='x1") 
à BEM Join data. Retain all values, all rows. 
C 3 NaN 


D NaN! T_ 


Filtering Joins 


adf [adf.x1.isin(bdf.x1) ] 
All rows in adf that have a match in bdf. 


x1 | x2 
A 1 
B 2 


xd | x2 


adf[~adf.x1.isin(bdf.x1) ] 
All rows in adf that do not have a match in bdf. 


Set-like Operations 


pd.merge(ydf, zdf) 
Rows that appear in both ydf and zdf 
(Intersection). 


pd.merge(ydf, zdf, how='outer' ) 
Rows that appear in either or both ydf and zdf 
(Union). 


pd.merge(ydf, zdf, how='outer', 
indicator=True) 
-guery('_merge == "left_only"') 
.drop(columns=[ '_merge' ]) 
Rows that appear in ydf but not zdf (Setdiff). 


Cheatsheet for pandas (http://pandas.pydata.org/) originally written by Irv Lustig, Princeton Consultants, inspired by Rstudio Data Wrangling Cheatsheet 


